Localized Centering: Reducing Hubness in Large-Sample Data

نویسندگان

  • Kazuo Hara
  • Ikumi Suzuki
  • Masashi Shimbo
  • Kei Kobayashi
  • Kenji Fukumizu
  • Milos Radovanovic
چکیده

Hubness has been recently identified as a problematic phenomenon occurring in high-dimensional space. In this paper, we address a different type of hubness that occurs when the number of samples is large. We investigate the difference between the hubness in highdimensional data and the one in large-sample data. One finding is that centering, which is known to reduce the former, does not work for the latter. We then propose a new hub-reduction method, called localized centering. It is an extension of centering, yet works effectively for both types of hubness. Using real-world datasets consisting of a large number of documents, we demonstrate that the proposed method improves the accuracy of knearest neighbor classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Class imbalance and the curse of minority hubs

Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-dimensional feature spaces. Hubs, very frequent nearest neighbors, emerge as centers of influence within the data and often act as semantic singularitie...

متن کامل

Time-Series Classification in Many Intrinsic Dimensions

In the context of many data mining tasks, high dimensionality was shown to be able to pose significant problems, commonly referred to as different aspects of the curse of dimensionality. In this paper, we investigate in the time-series domain one aspect of the dimensionality curse called hubness, which refers to the tendency of some instances in a data set to become hubs by being included in un...

متن کامل

Local and global scaling reduce hubs in space

Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa...

متن کامل

Hubness in the Context of Feature Selection

Hubness is a property of vector-space data expressed by the tendency of some points (hubs) to be included in unexpectedly many k-nearest neighbor (k-NN) lists of other points in a data set, according to commonly used similarity/distance measures. Alternatively, hubness can be viewed as increased skewness of the distribution of node in-degrees in the k-NN digraph obtained from a data set. Hubnes...

متن کامل

An Improved Unsupervised Cluster based Hubness Technique for Outlier Detection in High dimensional data

Outlier detection in high dimensional data becomes an emerging technique in today’s research in the area of data mining. It tries to find entities that are considerably unrelated, unique and inconsistent with respect to the common data in an input database. It faces various challenges because of the increase of dimensionality. Hubness has recently been developed as an important concept and acts...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015